Red Wine Exploration by Ariel Ma

This report explores a dataset of 1599 red wine samples. In the original dataset, there are 12 attributes used to describe the wine samples. The Quality attribute is the grade of the wine made by red wine expert based on sensory data. The quality of wine is between 0 (very bad) and 10 (very excerllent). For more details about the data set, please refer to:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Univariate Plots Section

Data Structure

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

I will be looking at the distribution of each attribute of red wine.

The quality distribution shows that most of red wines having quality between 5 and 7, with the lowest wine graded 3 and highest graded 8. There are no wines graded less than 3 or greater than 8.

According to the data description, fixed acids are the most acids involved with wine and do not evaporate. The distribution of fixed acidity of the 1599 wine population is slightly right skewed with and average around 8g/dm^3 and median of 8.3g/dm^3.

Too high of levels in volatile acidity can lead to an unpleasant vinegar taste. Therefore I am guessing the high quality wine should not have too much volatile acid in it. The distribution shows that majority of wines have a volatile acidity level from 0.2 to 0.8g/dm^3. Very few wines have volatile acidity more than 1.0g/dm^3.

Citric acid can add the ‘freshness’ and flavor to wines. The number of wines shows a decresing trend when citric acid level goes up. However, there are peaks around 0, 0.23 and 0.5g/dm^3.

Since the amount of Fixed Acidity is far more larger than the amount of Volatile Acidity and Citric Acid, the distribution of Total Acidity is almost the same as for Fixed Acidity. I am interested to see later on how pH can vary by the amount of different types of acid.

Residual sugar is the sugar remaning after fermentation stops. Most of wines will have residual sugar between 1 to 45 g/liter. In our dataset, the majority of wines fall in to range 1.5 to 3 g/dm^3 with an average of 2.54g/dm^3 and median of 2.2g/dm^3.

X axis here shows chloride - that is the amont of salt - in the wine. Most of the wines have salt in range 0.05 to 0.1g/dm^3. There are outliers that has more than 0.3g/dm^e chloride.

Free sulfur dioxide helps to prevent microbial growth and the oxidation of wine. By changing the bin width, it seems that free sulfur dioxide amount in the wine collection are all integer in mg/dm^3, with a mean of 15.87mg/dm^3, and median 14mg/dm^3.

Total sulfur dioxide respresents the amount of free and bound forms of SO2. SO2 will become evident in the mose when free SO2 concentrations is over 50 ppm.

After changing the bin size and removing outlier, it seems that total sulfur dioxide is also presented in integer and the average is 46.46mg/dm^3, median is 38mg/dm^3.

Density of wine depends on the percentage of alcohol and sugar content. Density follows a normal distribution with mean and median around 0.997g/cm^3.

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). pH of the wine samples also show a formal distribution with mean and median around 3.3.

Sulphate is a wine addtive which can contribute to sulfur dioxide gas levels, which acts as an antimicrobial and antioxidant. The distribution of sulphate is right skewed with a mean of 0.66g/dm^3 and median of 0.62g/dm^3.

Alcohol attribute shows how much alcohol by volumne in the wine sample. The average alcohol content in the wine samples is 10.42% and median is 10.2%.

Univariate Analysis

What is the structure of your dataset?

The dataset has 1599 wine samples and each sample has 12 attributes accociated. I have created a column named total.acidity to represent the sum of different type of acids. All attributes and their measured unit are listed below:
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. total acidity (sum of fixed acidity, volatile acidity and citric acid)
5. residual sugar (g / dm^3)
6. chlorides (sodium chloride - g / dm^3
7. free sulfur dioxide (mg / dm^3)
8. total sulfur dioxide (mg / dm^3)
9. density (g / cm^3)
10. pH
11. sulphates (potassium sulphate - g / dm3)
12. alcohol (% by volume)
13. quality (Numeric column with value between 0 to 10)

What is/are the main feature(s) of interest in your dataset?

I am interested to see the relationship between citric acid and quality, between residual sugar and quality, and the relationship between alcohol and quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Since density is depending on the amount of alcohol and sugur in the wine, I will be looking at the relationship between density and alcohol and the relationship between density and residual sugar.

Did you create any new variables from existing variables in the dataset?

I have created a column total.acidity for observing the relationship between total acidity and quality and total acidity and pH.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I have added total.acidity to be the sum of different types of acidity so it will be easy to see the changes in all types of acidity and their combined behavior as quality of wine changes. I have converted Quality attribute from num type to a factor to make it categorical just in case this will help with the analysis in the following section.

Bivariate Plots Section

Pair Plot

We can observe from the pair plot that alcohol has the strongest positive correlation with quality. It seems that the experts prefer the wines with more alcohol in it. Volatile acidity shows the strongest negative correlation with quality which indicates that the higher the more volatile acid in wine, the lower the quality of the wine.

Although it seems acidity has a big impact on wine quality, sweetness does not seem to have much meaningful impact on the quality. Residule sugar has a very week relationship with quality with a correlation coefficient at 0.01.

Sulphates also have positive correlation with quality with a correlation efficient 0.25.

To further examine the correlations, I have created plots to show correlation between each one of the supporting variables and quality.

Each plot shows the trend of the mean of a supporting variable for different quality grade. For example, the mean of fixed acidity is showing an increasing trend as the quality of wine increases. It seems that the increasing of volatile acidity and the decreasing of citric acid in wines offset each other as the quality increasing and leave the trend of total amount of acidity in wine complies with the trend of fixed acidity. To further prove the theory:

The above plot shows the trend of the amount of each acid, the total amount of acid with quality increasing.

We can see that fixed acidity and total acidity almost have the same trend. The amount of citric acidity and volatile acidity have the opposite trends and you are expecting same amount of citric acid and volatile acidity in high quality wines.

What also draws my attention is there seems to be a strong negative relationship bewteen chlorides and auality, although in the pair plot the correlation value is only -0.12.

The box plot shows that there is no obvious correlation between Chlorides and Quality. There are some outliers for wines graded between 4 and 7. However, the majority of wines have the amount of Chlorides of 0.05 to 0.1g/dm^3, including the highest graded wines.

Correlation Matrix

Apart from the relationships between quality and one of the supporting variables, I am also interested in seeing the relationships between two supporting variables. From the correlation matrix, we can observe that there is a strong positive correlation between Fixed acidity and total acidity which is expected as we know that the amount of fixed acidity contributes to most proportion of acid in wines. It also shows that the more acid in the wine, the denser the wine is. Similarly, the more residule sugur, the more dense the wine is. On opposite, the more alcohol in wine, the less dense the wine is.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

As the quality of wine increases, the amount of volatile acidity is decreasing and amount of citric acid is increasing. Plots prove that these two acid offset each other and leaving the total amount of acid in wines remaining the same trend as fixed acidity.

Sulphates also show an increasing trend as quality goes up.

It is also true from the plots that density and acid has a positive correlation. This might be becuase of the dissolution of acid in wine making the wine more dense.

What was the strongest relationship you found?

Alcohol has the strongest relationship with auality, whereas volatile acidity has the strongest negative relationship with quality.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

I will start this section with a summary of observisions made in previous sections:
1. The more acid, the more dense the wine is;
2. The more alcohol, the less dense the wine is;
3. Alcohol has the strongest positive correlation with quality;
4. Volatile acidity has the strongest negative correlation with quality;
5. The amount of Sulphate increases as quality goes up;
6. The better red wine tends to have more citric acid in it.

I am interested to see if these observisions can be further proved by multivariate plots.

Multivariate Analysis

The better wines tend to be less dense, have more acid and have more alcohol in it. As we know the density drops with alcohol increasing and increasing with acid decreasing, we can tell that with the quality goes up, the increament of alcohol is more than the increament of acid and this makes the density decreasing.

To be more clear on how Alcohol impacts the quality of wines, let’s build a linear model and check the statistics:

## 
## Call:
## lm(formula = quality ~ alcohol, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73 <0.0000000000000002 ***
## alcohol      0.36084    0.01668   21.64 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 0.00000000000000022

The value of F-statistic is 468.3 which shows that there is definitely a relationship between alcohol and quality. However, the low R^2 value shows that the linear model built may not be a very good model for predicting wine quality by using alcohol.

Is it possible that if we add more variables and the model will predict wine quality better?

Although it is not as obvious as the correlation between alcohol and quality, we can also see that good quality wines have less amount of concentration of sulphates. So what if we add sulphates as another variable when building linear model to predict wine quality? Would alcohol and sulphates together be a better predictor?

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6685 -0.3781 -0.1005  0.4992  2.4187 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1.37497    0.17745   7.748   0.0000000000000164 ***
## alcohol      0.34604    0.01628  21.256 < 0.0000000000000002 ***
## sulphates    0.99409    0.10235   9.713 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6905 on 1596 degrees of freedom
## Multiple R-squared:  0.2699, Adjusted R-squared:  0.269 
## F-statistic:   295 on 2 and 1596 DF,  p-value: < 0.00000000000000022

We can see that R^2 of the multiple linear model with alcohol and sulphates as predictor is slightly higher than our linear model of alcohol against quality.

Although we observed from previous section that citric acid has a positive relationship with quality, from above scatterplot, we can tell that this trend no longer obvious when adding alcohol.

Looking at the linear model built by alcohol, sulphates, citric acid, the R^2 value increases slightly to 0.28 showing that the model is a little better than only using alcohol and sulphates as variables.

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + citric.acid, data = redwine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7565 -0.3535 -0.1007  0.5067  2.2125 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1.43392    0.17615   8.140 0.000000000000000786 ***
## alcohol      0.33841    0.01619  20.903 < 0.0000000000000002 ***
## sulphates    0.81403    0.10651   7.643 0.000000000000036480 ***
## citric.acid  0.51345    0.09284   5.531 0.000000037217429663 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6842 on 1595 degrees of freedom
## Multiple R-squared:  0.2836, Adjusted R-squared:  0.2823 
## F-statistic: 210.5 on 3 and 1595 DF,  p-value: < 0.00000000000000022

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Alcohol, sulphates and citric acid all play a positive role in increasing wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I have created 3 linear models. The first model alcohol_quality_lm uses only alcohol as a variables to predict wine Quality. The second model alcohol_sulphate_quality_lm uses alcohol and sulphates to predict quality. The third model alcohol_sulphate_citric_acid_quality_lm uses alcohol, sulphates and citric acid to predict quality. Each model generate a better prediction then the previous one. However, the best model is with a relatively low R^2 value as 0.28. I think there must be other models that can provide a far better prediction. Therefore I will not create any models for predicting wine quality.

Final Plots and Summary

Plot One

Description One

This plot shows the trend of mean of each variable vs different quality grades. This graph provides a better visualisation than the correlation matrix to convey the relationship between variable and quality.

From the plot we can tell that volatile acidity and chlorides both have obvious negative relationship with quality; whereas citric acid, sulphates and alcohal shows positive correlation with quality respectively. I did closer inspection on these variables and tried to build linear models based on these attributes in Multivariate Analysis. Althoug the model is getting slightly better in predicting the wine quality as more variable being added, the best model made by alcohol, citric acid still have the R^2 value (0.28) too small to be an ideal predicting model.

Plot Two

Description Two

This plot helps to understand how much each acid contribute to the total acidity. Also it shows the trend of acid changes as quality is increasing. Depite the increase and decrease of citric acid and volatile acidity, the trend of fixed acidity follows the exact same trend as total acidity. This shows that as quality goes higher, the changes of citric acid and volatile acidity offset each other.

Plot Three

Description Three

This 3D graph shows the relationship among density, alcohol and total acid. We can see that higher quality wines tend to have more acid and more alcohol and at the same time less dense.


Reflection

I explored the red wine data set by first drew histograms of distribution of each attribute. I then visualised the relations between each pair of variables by drawing the pair plot. To better understand the changes of each ingredient as quality changes, I had made a grid to show the trend of each variable’s change with quality increasing. Lastly I built visualisations using multiple variables and further prove and strengthen my observisions. Linear model was not created to predict quality of wine as the R^2 values are relatively low and not considered to be a good model.

The biggest challenges faced completing this analysis was translating my findings into graphs and better convey the insight. There were lots of thoughts involved and I would say most of the time was spent on thinking how to organise and make the visualisation easy to understand yet informative.

I wish the data set could include the producing area of the wine as well as the producing year of the wine. More visualisations could be created - e.g. heatmap, time series plot - if these attribute were provided and I think it will make both the analyst and audience more engaged (“This wine is from my hometown!”). I also would like to adopt machine learning knowledge to find a better model to predict the quality of wine based on its attributes.